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1. Introduction 

Phylogenetic trees inferred from different genes often suggest different evolutionary histories 
for the species from which they have been sampled. This problem of gene tree 'incongruence' 
is widely recognized in molecular systematics [12] . It is particularly relevant to the question of 
the extent to which the history of life on earth can be represented by a phylogenetic tree, rather 
than a complex network of reticulate evolutionary events, such as species hybridization, lateral 
gene transfer (LGT) and endosymbiosis. There are several well-recognized causes of gene tree 
incongruence, most of which apply even in the absence of reticulate evolution, and we begin by 
discussing these. 

Firstly, there is always an expected amount of disagreement under any model of tree-based 
Markovian evolution, simply due to random sampling effects (i.e. the sequences are of finite 
rather than infinite length); moreover, this effect becomes magnified as branches in the tree 
become very short, or very long [21J. Furthermore, regardless of how much data one has, 
certain tree reconstruction methods may exhibit systematic errors, due to phenomena such as 
long branch attraction, or where the model assumed in the analysis differs significantly from 
the process that generated the data ('model mis-specification') [TO] . 

A second basic reason for gene trees to differ in topology from the underlying species tree 
is the population-genetic phenomenon of incomplete lineage sorting. As one traces the history 
of a gene sampled from different extant species back in time, the resulting tree of coalescent 
events can differ from the species tree within which these lineages lie. Recent theoretical work 
[24] based on the multi-species coalescent has shown that the most probable gene tree topology 
can differ from the species tree topology, when the number of taxa is greater than three. By 
contrast, it has long been known that for triplets, the matching topology is the most probable 
topology [22], [30]. Gene tree discordance due to incomplete lineage sorting is well-established 
in many data sets, and has been proposed as an explanation for why, for example, up to 30% of 
the gene trees in the tree ((human, chimp), gorilla) do not support this species relationship [14] . 
Further recent work on lineage sorting has investigated the statistical consistency of building 
species trees from gene trees (or the clades they contain) according to various consensus criteria 
[2], [8]. Additional reasons for gene tree discordance that are still consistent with a species tree 
are gene duplication and loss, and recombination. 

If we turn now to reticulate evolution, it helps to distinguish between two types: hybridization 
(which will include, for example, endosymbiosis, the transfer of a sizable percentage of the 
genome of one species into another or the combination of two genomes into a larger genome) 
and LGT (which is widespread in bacteria and includes the transfer of one or a small number 
of genes from one organism to another). In the case of hybridization, it is clear that no single 
tree can adequately describe the evolution of the taxa under study, and that a network (or a 
set of species trees) is usually a more appropriate representation. 

Hybridization will lead to gene tree incongruence but it leaves a statistically different signature 
to the processes of lineage sorting or sampling error discussed above. For example, in the case of 
lineage sorting, for a given triplet of taxa, we expect that one of the two topologies will be well 
supported, and the other two topologies will have lower but approximately equal support. On 
the other hand, under hybridization, we expect to find support for two of the three topologies 
that reflect the hybridization event but little support for the third). A number of authors have 
explored this question of distinguishing hybridization from lineage sorting [JJ, [T3], [TC], [19] , 

The second type of reticulate evolution, LGT, is the main concern of this paper, and is 
particularly relevant for prokaryotic evolution [6], [9], [18] . A fundamental and much-debated 
question is whether a species tree can be reconstructed from gene trees if genes are randomly 
transferred between the lineages of the tree [I], |29j . One viewpoint holds that if the vast 
majority of genes have been transferred during their history then few gene tree topologies will 
agree with any species tree topology, so it makes little sense to talk about a single species tree 
[3], [7] (however, the same claim could be made, in error, for lineage sorting, as argued by [12 J. 
An alternative view is that one can still recover statistical support for a central species tree even 
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in the presence of relatively high rates of LGT (p], [23], [29]), particularly if these transfers 
occur in a mostly random (rather than concerted) fashion. 

Random models for LGT have been proposed and studied by a number of authors, particularly 
[20] . |23j and [28] . These models are somewhat similar - random LGT events occur 
according to a Poisson process, and the main differences concern whether the rate of transfer 
between two points in the tree is constant or dependent on the phylogenetic distance between 
them. In [23j, the most recent of these papers, Roch and Snir establish a strong transition result, 
which shows how the species tree can be reconstructed from a given (logarithmic) number of 
gene trees, provided that the expected number of LGT events lies below a certain threshold; 
above this threshold, it becomes impossible to distinguish the underlying species tree from 
alternative trees based only on the given gene trees. Our results are complementary to this 
work, as our interest is more in the statistical consistency of species tree reconstruction and, in 
particular the consistency of tree reconstruction of triplets in a larger tree. 

In our paper we begin by setting up some definitions to formally describe the way in which an 
arbitrary sequence of LGTs on a species tree determines the topology of the associated gene tree. 
We consider the combinatorial aspects of this process for any given triplet of leaves. We then 
introduce the model of random LGT events from [20] along with an extension to allow the rates 
of LGT to vary with time and with phylogenetic distance. Under these models, we provide an 
exact analysis of this model on three-taxon and four-taxon trees, showing that for three-taxon 
trees, the species topology always has strictly higher probability than the other two competing 
topologies, but for certain four-taxon trees there is a zone of (weak) statistical inconsistency. 
In Section [6j we consider trees with an arbitrary number of leaves and establish a sufficient 
condition for statistically consistent species tree reconstruction from gene trees. Essentially, 
this condition is an upper bound on the expected number of transfers of certain types in the 
tree. 

We then discuss how estimates of the rate of LGT in a large tree could impact on this 
analysis, and consider the difficult problem of reconstructing a tree when the topology of each 
gene tree can be influenced by both LGT and incomplete lineage sorting processes. Finally, 
we describe and illustrate a simple algorithm for reconstructing a species tree from gene trees 
which may have patchy taxon coverage; here tree reconstruction is statistically consistent if 
each gene evolves under the random model of LGT or incomplete lineage sorting, provided the 
rate of LGT is sufficiently low. We end with a brief discussion and some questions for further 
work. 

2. Combinatorial LGT analysis 

2.1. Definitions. Throughout this paper X will denote a set of species of size n, and A will 
denote a subset of X of size 3. Consider a rooted phylogenetic 'species' tree T, with leaf set 
X, and a vertex p. We will regard T as a 1-dimensional simplicial complex (i.e. the edges as 
intervals) so each 'point' p in T is either a vertex or an element of the interval that corresponds 
to an edge. Consider a coalescence time scale: t : T — > [0, oo) of the tree with the coalescence 
time increasing into the past. Then: 

• t(p) = 4=> p is a leaf, 

• If u is a descendant of v then t(u) < t(v). 

We refer to t{p) as the t-value of p and to t{p) as the timespan of the tree (the time from the 
present to the most recent common ancestor (MRCA) of all the species in X). 

A lateral gene transfer (LGT) onT (or, more briefly a transfer event) is an arc from p € T to 
p'£T where p and p' are contemporaneous, i.e. t(p) = t{p'). We will also assume that neither 
p nor p' are vertices of T (i.e. transfers go between points on the edges of the tree). 

We write a = (p,p ! ) to denote this transfer event and we write t(o~) for the common value of 
t(p) and t(p'). We will assume that no two transfer events occur at exactly the same time. 

Let a = (7i,...,o-fc be a sequence of transfer events (<7j = {pi,^)) arranged in increasing 
t-value order (thus a\ is the most recent transfer event and o~k is the most ancient; moreover, 
the total ordering is well defined by the assumption that no two transfer events took place at 
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the same date). Thus, we will always assume in what follows that: 

< t(o-i) < t(a 2 ) < ■ ■ ■ < t(a k ) < t(p). 

We refer to cr as a transfer sequence on the species tree T. In the biological context, we view 
a as describing the transfer history of a particular gene, so different genes will have different 
associated transfer sequences (including, possibly, the empty transfer sequence, if no transfer 
events occur on T). 

Given a species tree T with the leaf set A and a transfer sequence a = <ri, . . . , on T, we 
obtain an associated gene tree T[a\. To describe this tree more precisely, we assume, as in [20] , 
that an LGT arc from point p to p' replaces the gene that was present on the edge at p' with 
with the transferred gene from p. Thus, if we trace the history of the gene from the present to 
the past (i.e. in increasing coalescence time), each time we encounter an incoming horizontal 
arc into this edge we follow this arc (against the direction of the arc). In this way, the species 
tree, along with any sequence of LGTs describes an associated gene tree. We can formalize 
this mathematically as follows. For a transfer sequence a = a±,...,ak where Oi = (pi,p'i), 
consider the tree T together with a directed edge for each crj placed between pi and p\ for 
each i G {1, . . . , k} and regard this network as a one-dimensional simplicial complex. Now for 
each i G { 1, . . . , k}, delete the interval of this 1-complex immediately above p\ and consider the 
minimal connected subgraph of the resulting complex that contains A. Call this tree T[q\. An 
example is shown in Fig. [TJ 




(i) (ii) 

Figure 1. (i) A rooted binary X— tree T with a sequence a of six transfer 
events, labelled in increasing order into the past; (ii) The tree T[a\. In this 
example, a induces a match for a, b, c and a mismatch for a, b, d. 

• Given the pair T,a = a±,..., define the following sequence of derived A— trees, for 
< r < k: 

T = T, T r = T r ^i[a r ). 

Thus, Tfc = T[a], and for 1 < r < k, T r = T[a\, . . . , ay] is the gene tree obtained from T 
by performing just the r most recent transfers. 

• Given T' G {To,T%, . . . a point p G T' and any non-empty subset Y of A, let 
desy(X",p) denote the subset of Y whose elements are the descendants of p (i.e which 
become separated from the root of T" if p is deleted) . 

3. Triplet analysis 

Consider a species tree T on A. For the rest of this paper, we will assume that T is binary 
(i.e. fully resolved). Let A = {a, b, c} be a subset of A of size 3. We write T\A to denote the 
binary phylogenetic tree on leaf set A that is induced by restricting the leaf set of T to A |26) . 
We refer to T\{a, b, c} as a triplet (species tree) topology, and we write a\bc to denote the triplet 
topology in which the root of this tree separates leaf a from the pair b, c. 
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3.1. Key triplet definitions: For a sequence a of transfer events, we say that: 

• a induces a match for A = {a,b,c} C X if the species tree T and its associated gene 
tree T[a] resolve a,b,c as the same three-taxon tree; i.e., if 

T[a]\A = T\A. 

Otherwise, if T[<r]|.A is one of the other two rooted binary tree topologies on leaf set A, 
we say that a induces a mismatch for A. An example is provided in Fig. [2] 

• For a transfer event a = (p,p'), we say that a is into an A— lineage if desA(T,p') is a 
single element of A (two such transfers are shown in Fig. [2J. 

Now, suppose that a = a%, . . . is a sequence of transfers on T. Consider the resulting 
sequence (T r ; < r < k) of derived trees and let a r = (p r ,p' r )- 

• If desA(T r -i,p' r ) = {x}, for some x 6 A, we say that a r is an A— transfer, and that it 
transfers x. 

• If ay transfers x and des^(TJ._i,p r ) = 0, we say that a r moves x and we refer to o> as 
an A— moving transfer. 

• If a r transfers x, and if des^(T r _i,p r ) = {y}, we say that a r joins x to y and we refer 
to o> as an A— joining transfer. 

Note that any ^4— transfer is either a moving or joining transfer (but not both). Examples of 
A— transfers are shown in Fig. [2] Notice also that the first A— transfer is always a transfer into 
an A— lineage, but later ones need not be. Moreover, a transfer into an A— lineage may not be 
an A— transfer if certain other A— transfers proceed it in some transfer sequence. 



tip) P 




Figure 2. A tree subject to a sequence a = o~i, a?, of two transfer events, which 
induces a match for A, even though each of its component transfers by itself 
would induce a mismatch topology. The transfer event o~\ moves b (and so is 
an A— moving transfer) while o"2 joins c to b (and so is an A— joining transfer). 
Note that if o~\ was removed then o"2 would become an A— moving transfer. 

3.2. Triplet combinatorics under LGT transfers. We now state two combinatorial lemmas 
which will form the basis for the stochastic analysis that follows later. 

Let tA denote the time from the present to the MRCA in T of the closest pair of taxa in A 
(so if T\A = a\bc then tA is the time from the present to the MRCA of b and c). 

The following lemma collects for later reference some observations concerning the impact of 
different types of transfers. 

Lemma 1. Let a_ = o~\, . . . ,o~k be a sequence of transfer events on a rooted binary X—tree T 
and let A = {a, 6, c} C X . 

( a ) If £ induces a mismatch for A, then a must contain an A— transfer with a t— value less 
than tA- 

(b) Moreover, precisely one of the following occurs: 
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(i) <t has no A— transfers. In this case, a induces a match for A. 

(ii) a_ contains at least one A— joining transfer. In this case, if the first such transfer 
in a_ joins x to y then T[a}\A = z\xy where {x, y, z} = A. 

(iii) a has no A— joining transfers, but it has an A— moving transfer with a t— value less 
than tA- In this case, if a r denotes the first such A— moving transfer in a then: 

T[a}\A = T[a r ,...,a k ]\A. 

A second combinatorial lemma, extends case (b) (iii) of Lemma [T] slightly and is used in the 
proof of Theorem [5] In order to state it, we need some further definitions. 

Suppose a = o\, . . . ,ak is a sequence of transfer events on a rooted binary X— tree T with 
t(°~k) < ^A) an d with no A— joining transfers. Construct an associated sequence of trees 
Tq, Tg, . . . , T£ as follows. Set Tq = T and construct by T/ +1 from T- by the following pro- 
cedure. If Oi is not A— moving then set T/ +1 = T[. If o~i = (pijp'j) moves x G A = {a, b, c} then 
let T' i+l be the tree obtained from T[ by: 

(i) deleting all p £ T[ with t(p) < t(cjj), 

(ii) labeling pi by x, 

(iii) For each z G A — {x}, assigning label z to the unique point p z o£T[ that has t{p z ) = i(o"j) 
and z G desA(T-,p z ). 

(iv) We will regard the other leaves in the tree as unlabeled. 

Finally transfer the time dating from Tj across to T.- L+ \. 

Lemma 2. Suppose q_ = <7i, . . . , a/, is a sequence of transfer events on a rooted binary X—tree 
T with t(o~k) < tA, and with no A— joining transfers. Then T[a}\A = Tt\A. 



4. Statistical signals for the central tree under LGT and incomplete lineage 

SORTING 

We begin by recalling the model of LGT described in [2D], which made the following as- 
sumptions: (1) a binary, labeled, rooted and clocklike species tree T is given, as well as all the 
splitting times along this tree; (2) differences between a gene tree and T are only caused by 
LGT events; (3) the transfer rate is homogeneous per gene and unit time; (4) genes are trans- 
ferred independently; (5) one copy of the transferred gene still remains in the donor genome; 
(6) the transferred gene replaces any existing orthologous counterpart in the acceptor genome. 
We refer to this model from [20J as the standard LGT model and we will consider extensions 
of it which relax assumptions (2) and (3). In particular, consider the following relaxation of 
assumption (3) in which transfer events on T occur as a Poisson process through time, in which 
the rate of transfer event from point p on a lineage to a contemporaneous point p' on another 
lineage at time t occurs at the rate f(d(p,p'),t) where f(d, t) is a constant or at least monotone 
non-increasing function in d (though it can vary non-monotonically in t) and d(p,p') is the 
evolutionary distance in the tree between contemporaneous points p and p' in the tree. We call 
this model the extended LGT model. 

Both the standard and extended LGT models induce a well-defined probability distribution 
on gene tree topologies, both for the original set of taxa, and for any subset. 

In the standard LGT model, the number of transfers has a Poisson distribution, with a mean 
equal to the rate of LGT transfer out of any given point in the tree times the sum of the branch 
lengths (phylogenetic diversity) of the tree. We first establish that the Poisson distribution for 
the number of LGT transfers (in total or just the transfers in an A— lineage) still holds for the 
extended LGT model. 

Lemma 3. Under the extended LGT model, the total number of transfers, and the number 
of transfers into an A— lineage (for any given subset A of X of size 3) each have a Poisson 
distribution. 
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Proof. We begin by recalling a general property for any collection (Pi, . . . , P&) of independent 
Poisson processes, where process Pj has intensity rj(t). Let lj(io) count the number of times 
process Pj occurs up to time to, and let Y = Yli=i ^i(*o)- Then Yi(to) has a Poisson distribution 
with mean J^ ri(t)dt. Moreover, since the sum of independent Poisson random variables has 
a Poisson distribution, with a mean equal to the sum of the individual means, it follows that 
Y has a Poisson distribution, with mean Yli=i Io° r i(f)dt. (for background on these stochastic 
results, the reader may wish to consult [13J ) . We apply this result as follows: let N be the 
total number of transfers and Na the number of transfers into an ^—lineage in the tree. We 
can express iV by considering all intervals I between speciation events in T and, within each 
interval, consider all ordered pairs of lineages (restricted to that interval) h,l2- For each such 
ordered triplet i = (I, hjfa), let Pj be the Poisson process of transfers from l\ to I2 (which 
may depend non-homogeneously on time) and Ni denotes the number of these transfers for this 
triplet. Then the Pj are independent processes, and N = J2i=(i i x i 2 ) so N has a Poisson 
distribution. 

Regarding Na, we consider all intervals I between speciation events in T and, within each 
interval, consider all ordered pairs of lineages (restricted to that interval) h,l2, where I2 has 
exactly one of a, 6, c as a descendant. Let Pj be the Poisson process of transfers from l\ to I2 
(which may depend non-homogeneously on time) and let N- denote the number of such transfers 
for this triplet. Note that the Pj are independent processes, and that Na = J2i=(i h i 2 ) ^ii anc ^ 
so again has a Poisson distribution. 

□ 

Although the process of LGT transfers (under the standard or extended) model is a continuous 
time process, there is an associated discrete process that induces an identical distribution on 
gene trees; it is obtained by considering the decomposition described in the proof of Lemma [3j 
Thus with any sequence a = <j\, . . . , of LGT transfers on T, we may associate the discrete 
sequence s = si, . . . , s& where Sj refers to a triple (I, h,fa), in which case T[a] = T[s] (i.e. all that 
matters in determining the resulting tree topology is the sequence of transfers between branches 
of the tree and their relative ordering, not the actual times that they occur). Consequently, each 
such discrete sequence s = s%, . . . , Sf. for T has a positive probability, and under the standard 
LGT model, this probability has a Poisson distribution that just depends on k. 

5. Exact analysis of triplets in three- and four-taxon trees 

For a transfer sequence a generated by this type of LGT process, and any triplet a,b,c £ X, 
we are interested in the probability that a induces a match for a, b, c. 
For three leaves, an exact analysis is straightforward, as we now show. 

Proposition 4. If T has just three taxa, then under the extended IGT model, the probability 
that a transfer sequence induces a match for the three taxa is strictly greater than the probability 
it induces either one of the two mismatch topologies (which have equal probability). 

Proof. When A = {a, b, c} then there are no ^—moving transfers, and a transfer a is A— joining 
if and only if t(o~) < tA', let iV denote the number of such transfers. We can express iV as the 
sum of six random variables, which counts the number of transfers from the lineage x to lineage 
y (for x, y G A,x 7^ y). By the assumptions of the model, and Lemma [3j iV has a Poisson 
distribution with some fixed mean m. Now if = we obtain a matching topology for the 
three taxa, while if iV > then, by Lemma [TJ the topology of the tree is determined by the 
first transfer (since it is, by necessity, A— joining) and under the assumption of the model (in 
particular invoking the property that / is non-increasing), the probability that this first transfer 
is between two of the most closely related taxa is at least |. Thus, the probability of a matching 
topology for the three taxa is at least: 

P(N = 0) + \f(N > 0) = e~ m + \(1 - e~ m ), 

while for either mismatched topology, the probability is equal to the other mismatch topology 
and is no more than g(l — e~ m ). This completes the proof. □ 
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5.1. Four-taxon case. The analysis of the distribution of triplet gene tree topologies generated 
by random LGT transfers on a four-taxon tree is considerably more interesting than the three- 
taxon case. Note that with four taxa, there are two rooted binary tree shapes - the 'fork-shaped' 
tree (with two cherries, as shown in Fig. [3]) and the pectinate tree (with one cherry). We will 
provide an exact analysis for the first of these tree shapes under the standard LGT model, as 
this suffices to demonstrate a zone of statistical inconsistency, though we discuss briefly how 
an analogous (but more complex) analysis could be carried out for the other rooted tree shape. 
Our analysis requires no approximations, nor any imposition of an upper bound on the number 
or rate of LGT transfers. It relies on associating a random walk on a six-cycle to the LGT 
process. 

We begin with some definitions. For four leaves x,y,z,w, we write (x,y;w,z) to denote the 
rooted binary tree, with the leaves x, y on one side of the root and w, z on the other, and with 
the MRCA of x, y at a fixed time t^ xy y and the MRCA w, z at a later fixed time t^ w>z y. Thus 
(x, y; w, z) = (y, x; w, z) = (y, x; z, w) = (x, y; z, w), but no other symmetries hold. For example, 
the tree (o, *; 6, c) is shown in Fig. and the tree (6, c; a, *) in Fig. [3](ii) . Here * refers to the 
fourth taxon, the identity of which plays no role when we come to consider the topology of the 
triple a, b, c. 

We now state the main result of this section. 

Theorem 5. Suppose T is a rooted four-taxon tree and A = {a, b, c} is a subset of the leaf 
set X of T , and suppose that T\A = a\bc (thus T\A is a tree of type r' a or r a in Fig. [3]). Let 
F{x\yz) = F(T\A = x\yz) under the standard LGT model of [20], where {x,y,z} = A. 

(i) Suppose T is of type r a . Then for any t— value for the MRCA of (a, *), there is a suf- 
ficiently large t-value for the MRCA of (b, c), for which the matching gene tree topology 
a\bc has a lower probability than either of the alternative mismatch topologies. 

More precisely, for [i = |A£/ 0i *} and B = 3A(t/(,,c} ~~ *{a,*}); we have: 

(1) F(a\bc) = \[l~ e" 7/i (l - e" 2 ^ - e~ B (l + e~ 2 "))] 

andF{b\ac) = F(c\ab); moreover, F(a\bc) < F(b\ac) = F(c\ab) if and only ift^ c y — £{ a ,*} 
is greater than ^ ln( ^^_^ ). 

(ii) Suppose T is of type r' a . Then for any t— values for the MRCAs of (a, *) and (b, c), the 
matching gene tree topology a\bc has a higher probability than either of the alternative 
mismatch topologies. 

More precisely, for r' a and \i = |Air 6 c \, and B = %{\ts a ^\ — tj b c y), we have: 

(2) F(a\bc) = \[l + e^(l + e" 2 ^ - e~ B (l - e^))] . 

In this case, F(a\bc) > F(b\ac)(= F(c\ab)) for all values o/t/ 0) *\ and £{6 iC } — £{ a ,*}- 

Proof. First observe that, by symmetry, we have: 

(3) P(6|oc) =P(c|o6), 

under the standard LGT model (indeed this holds here even under the extended LGT model). 
Consider the cyclic graph whose nodes are the six trees: 

T a = (a,*;b,c), r' a = (b,c;a,*), n = (b,*;a,c), t{, = (a, c; b, *), 

t c = (c,*;a,6), t' c = (a,c;c,*), 

which are connected into a cycle as shown in Fig. |4| 

Now, let Zt : t > be a continuous-time symmetric random walk on this 6-cycle graph, where 
the instantaneous rate of moving from one node to either given neighboring node is 1. Let p r (t) 
r = 0, 1, 2, 3 be the probability that, after running the process for time t, this Markov process 
is at a node that is graph distance r (by the shortest path) from its initial state. 
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(i) 



(ii) 



Figure 3. (i) The species tree topology r a = (a,*; b, c) for which the standard 
random LGT model confers, for any value of £{ a ,*} an d a sufficiently large value of 
i{f, )C }, a higher probability for each of the the gene tree mismatch topologies b\ac 
and c\ab than for the matching species tree topology a\bc. (ii) The species tree 
topology r' a = (b, c;a, *) which, in contrast to (i), always has higher probability 
under the standard LGT model for a matching gene topology a\bc than for either 
mismatch topology. 



n 




a * b 





a c * 



a b * 





* b a 



\ A,/ 




Figure 4. Conditional on there being no ^—joining transfers between the 
present and tA, the topology of the tree sequence T[ induced by A— moving 
transfers in this interval is equivalent to a simple random walk on the six-cycle 
graph shown. 

Note that p = p(t) = [po(t),Pi{t),P2(t),P3(t)] t satisfies the system of first-order linear differ- 
ential equations: 

d 
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where B is the 4x4 tridiagonal matrix: 



-2 


1 








2 


-2 


1 








1 


-2 


2 








1 


-2 



and so: 

(4) p(t) = exp(Bt)p(0) where p(0) = [1, 0, 0, 0]'. 

The eigenvalues of B are 0, -1, -3, -4, and the random walk on the 6-cycle is a reversible 
Markov process with uniform equilibrium frequency, and so 

lim p(i) = [1/6, 1/3, 1/3, 1/6]*, 

t— >oo 

and each component Pj(t) is of the form: 

Pj(t) = a,j + bje^ + Cje~ 3 * + dje~ u , 

for constants aj, . . . , dj that are determined by the eigenvectors of B. 

Using standard matrix diagonalization techniques from linear algebra, we obtain the following 
solution to Eqn. Q: 



P(*) 



1 



■1 





1 








e"* 




e -3t 




e -4i 







1 -1 -1 

1-11 

From this, one immediately obtains the following result, which will be required later. 
Lemma 6. For all t > 0: 



Pi(t)-2p 3 {t) 



e~ 3t > 0, 



and 



+ e 



> 0. 



2p (t) - P2 (t) = 

We return now to the proof of Theorem [5] We will establish part (i), and indicate how the 
proof of part (ii) follows by a directly analogous argument. 

Let £ denote the event that the random sequence of transfer events a generated by the model 
induces a match for A. Let J denote the number of A— joining transfers between t = and 
t = £{ a) *}. Then J has a Poisson distribution with mean 2At{ a ! „} = 6/i, since at any moment 
in the interval [0, t{ a) *}], there are four lineages, three of which lead to leaves in A (therefore, 
for any x £ A, the rate of transfer from that x— lineage to any lineages that also lead to A is 
A • (2/3)). Thus the cumulative rate of an A— joining transfer is 3A • (2/3) = 2A. Consequently: 

(5) P(J > 0) = 1 - e-^. 
Now, Lemma [l] (part (b) (ii)) implies that: 

(6) P(£|J>0) = ^, 



and by the law of total probability, Eqns. §5§ and ^ give: 
(7) ¥(£) = 1(1 - e~^) + e' 6fl F(£\J 



0). 



Now the A— moving transfers between t = and t = i{ a ,*} constitute a continuous-time 
Poisson process for which the rate at which any given x E A is moved is |A. Note that this 
process is independent of J since the source point of an A— joining transfer is always on a 
different type of lineage (having an element of A as a descendant) from an A— moving transfer. 
As the A— moving process proceeds in time (from t = to t = i{ aj *}), the resulting sequence 
of trees T' k described in the preamble to Lemma [2] corresponds to a simple symmetric random 
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walk on the nodes of the 6-cycle shown in Fig. |4j starting with tree r a at time t = 0, and where 
the rate of moving from one node to any particular neighboring node is ^A. At time t = s, the 
length of two pendant edges of the r— tree will be t{ a ,*} — s while the other two pendant edges 
have a larger length of i{f> iC } — s > so we stop the process when the length of the shorter pair of 
pendant edges reaches zero (i.e. at t = i{ a ,*})- Note that the length of the pendant edges does 
not affect the transition process under the standard LGT model, so we can indeed view it as a 
discrete-state random walk on six states (rather than on a continuum of states). 

Lemma [2] now ensures that if a[ is the sequence of A— moving transfers between t = to 
t = t{ a ^y then T[a'} resolves a, b, c in the same way as the tree n does, where t% is the state 
of the random walk on the 6-cycle at time t{ a ^y At time t = the random walk on the 

6-cycle is at one of the following nodes: 

• r' a , in which case T[a]|^4 = a\bc (with probability 1); this does not depend on any transfer 
events that may occur after t^ a ^y, 

• T a , in which case T[cr]|yl = a\bc 

— with probability 1 if there is no transfer event between t^ a ^y and t{& iC }, or 

— with probability | if there is at least one transfer event between i{ a) *} and i{b )C }j 

• or r c , in which case T[a]|^4 = a\bc with probability | if there is at least one transfer 
event between i{ ai *} and t^ bc y; 

• r' b or t' c , in which case T[ct]|j4 7^ a\bc regardless of any further transfers. 

In this case analysis, the probability factor | arises from Lemma [TJ part b(ii). Note also that 
the probability that there is at least one transfer event between t{ a ,*} and t{b,c} nas probability 
1 — e~ B , and this event is independent of the random walk on the 6-cycle. 
Consequently, by independence, and by combining these cases we obtain: 

F(£\J = 0) =j>3( M ) • 1 + Po (aO • (e~ B + ^(1- e- B ))+p 2 (fi) • |(1 - e' B ) + Pl (fj,) ■ 0. 

which, combined with Eqn. ([7]) and Lemma [6j establishes Eqn. Q. The remainder of part (i) 
is justified by Eqn. ^ and straightforward algebra to determine when the expression in Eqn.Q 
is lower than the probability of a specific mismatch topology on A. 

For the proof of part (ii), an analogous argument shows that for T\{a,b,c} of type r' a and 
H = jjrAi{b iC } and B = 2>{\t^ a ^ — t^ )C y), Eqn. ^ still holds, that is: 

P(£) = hl- e" 6 ^) + e~ 6M P(£| J = 0). 
For the last term, we find that: 

(8) P(£\J = 0)=Po(jJt) + (pi(M) +P3(M)) • (1 ~l ^ +P3O0 • e- B + P2 (fx) ■ 

from which Eqn. ^ now follows, by Lemma [HJ The remainder of part (ii) is justified by 
Eqn. ([3]), and straightforward algebra shows that the expression in Eqn. @ is never lower than 
the probability of a specific mismatch topology on A. 

□ 



5.2. Statistical inconsistency? Part (i) of Theorem [5] shows that the species tree topology 
for three taxa inside a larger tree can have the lowest probability among the three possible gene 
tree topologies on those three taxa, under the standard LGT model (see Fig. 5). This is in 
sharp contrast to what occurs with incomplete lineage sorting, where the most probable gene 
tree for three taxa matches the species tree topology for those three taxa, regardless of what 
other taxa are present, and how they are arranged in the species tree. Thus, in the setting of 
Theorem , estimating the species tree for a set A of three taxa from the frequency of triplet 
gene trees will be statistically inconsistent (it will converge on an incorrect tree). However, this 
does not imply that one cannot estimate the species tree from the probability distribution of 
all gene trees topologies (and for all subsets A of size three from X). Moreover, if we use the 
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Figure 5 . Plot of Equation 1 in the statement of Theorem [5] Note that the 
probability F(a\bc) for a match (shown on the z-axis) is less than | in the bottom 
right-hand side of the figure. 

R* tree reconstruction method for four-taxon tree having the clades {a, b}, {c, d}, Theorem [5] 
(parts (i) and (ii)) shows that we will always return a tree that includes at least one of these 
clades, and no contradictory clades. Thus, the R* method would, in this case, be only weakly 
inconsistent (i.e. it would return a tree that is either equal to or is resolved by the species tree, 
rather than being positively misleading). 




5.3. The other tree shape on four taxa. One could perform a similar analysis for the 12 
rooted binary trees that have the four leaves {a, b, c, *} and just a single cherry (rather than two 
cherries as above). In this case, the associated transition graph consists of a 12-cycle, together 
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with three additional edges - obtained by placing an edge between ((xy)z)* and ((xy)*)z for 
each of the three choices of {x, y} from {a, b, c}. This graph is shown in Fig. [6| The analysis of 
the probability of a matching topology for a, b, c under the random LGT model (depending on 
the position of the * lineage) could be carried out by a similar, albeit more complex, analysis 
to that for the simpler 6-cycle graph, but this is beyond the scope of the current paper. 

6. General case, trees with ?i-taxa 

We now include the three- and four-taxon results into a more comprehensive statement con- 
cerning the statistical consistency of species tree reconstruction under the LGT model, for an 
arbitrary number of taxa. 

Theorem 7. Consider the standard or extended LGT model on a rooted binary phylogenetic 
X—tree. 

(i) If T has just three taxa, then under either model the probability that a transfer sequence 
induces a match for the three taxa is strictly greater than the probability that it induces 
either one of the two mismatch topologies (which have equal probability). 

(ii) A four-taxon tree and branch lengths exist for which the model can assign higher proba- 
bility to a particular mismatch topology for some triplet, than for a match, even under 
the standard LGT model of |20| . 

(iii) Regardless of the number of taxa in the tree and the branch lengths if, for some subset 
A of taxa of size 3, the expected total number of transfers into an A— lineage (for the 
particular gene) is no more than 0.69 in the extended model, and no more than 1.1 4 in 
the standard LGT model, then the probability of a topology match is strictly greater than 
the probability of either of the mismatch topologies. 

(iv) When LGT rates ensure that condition (iii) holds for every subset of A of taxa of size 
3, there is a polynomial-time method for reconstructing the species tree from the gene 
trees which is statistically consistent under the model, as the number of independently 
generated gene trees tends to infinity. 



Proof. Parts (i) and (ii) are established by Proposition [4] and Theorem [5] respectively. 

Proof of part (iii): Let N A denote the total number of transfer events of the gene in the tree 
into an A— lineage. By Lemma [3j N A has a Poisson distribution with some mean m. Then for 
any triple a,b,c, suppose that T\{a,b, c} = a\bc. Then if N A = then there is a match with 
probability 1. Thus, in the extended model, if m < ln(2) ~ 0.69 then the probability of a match 
is at least F(N A = 0) = e~ m > 0.5. This establishes the first claim in part (iii). 

For the second claim in part (iii), consider the standard LGT model. In the Appendix we 
establish the following claim by means of a coupling-style argument. 

Claim 1: Consider a sequence of transfer events under the standard LGT model. Then, 
conditional on the event that N A = 1, the probability p that this sequence of transfers induces 
a match for A is greater or equal to the probability q of inducing a specific mismatch topology 
(say c\ab) for A. 

Now, under the standard LGT model, the probability of a match is at least: 

(9) F(N A = 0)+p-F(N A = l), 

while the probability of a specific mismatch topology (say c| ah) is at most 

(10) q-F{N A = l)+F(N A >l), 

and from Claim 1, p > q so the difference obtained by subtracting the mismatch topology 
probability (10) from the matching topology probability ^ is at least: 



F(N A = 0) - F(N A > 1) = e~ m - (1 - e~ m - me" m ) = e" m (2 + m) - 1, 
and the term on the right is strictly positive for m < 1.14, as claimed. 
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Proof of part (iv): We can apply the same argument used by [8], who showed that the 
(polynomial time) R* tree reconstruction method (based on triplet topologies) is statistically 
consistent under models of incomplete lineage sorting. Here, we are dealing with LGT rather 
than incomplete lineage sorting, but the only property required of either model in order to 
ensure the statistical consistency of the R* method is that for each triplet A, the probability 
that the gene tree matches the species tree topology restricted to A has a probability that is 
greater (by some fixed e > 0) than either of the other two topologies (in the case of incomplete 
lineage sorting, the two alternative topologies have equal probability, but this may not be the 
case under LGT; however, this is not essential to prove consistency). □ 

6.1. Rates of LGT. Theorem [j] (part (iii)) requires a small expected number of transfers 
into an A— lineage for any subset A. The question arises as to how this expected number 
would compare with the total expected number of transfers in the tree. The total number of 
LGT transfers in the tree Nt ot has a Poisson distribution; under the standard LGT model this 
distribution has mean r • L(T), where r is the rate of LGT transfer out of any given lineage at 
any given time and L(T) is the phylogenetic diversity of T (the sum of the lengths of all its 
branches). For a Yule (pure-birth) tree with n taxa and timespan t, if the speciation rate is set 
equal to its expected value, then from [27] . L(T) has expected value: 

On the other hand, the rate of transfers into an A— lineage at any time is at most 3r and 
so the expected number of transfers into an A— lineage is, at most, 3rt. Thus, if the expected 
number of LGT transfers in the entire tree is G then the expected number of transfers into 
an A— lineage in a tree with phylogenetic diversity equal to its expected value under the Yule 
model is, at most: 

(12) 3Hn/2) G 

(12) 7^2T 

For example, if n = 200 and G < 10 (on average every gene is transferred at most 10 times 
on the tree) then (12) takes the value 0.7, which is within the 1.14 bound of Theorem [7] (part 



(iii)). We note that in one study, it was suggested that the average number of times each gene 
has been transferred might be around 1.1 [6j, so the condition imposed in Theorem [5] (iii) may 
not be unreasonable. In the recent paper by p], an average rate across bacterial genomes was 
estimated at between .02 to .04 LGTs per branch of the tree. Thus, for n = 200, G is about 
8-16 events, where the upper end of these results include some cases of extremely high rates of 
LGT in bacteria. 

6.2. LGT and incomplete lineage sorting. We have described sufficient conditions for the 
R* tree- reconstruction method to be a statistically consistent estimator of a species tree topology 
under various LGT models. Moreover, it is known that the R* method is also a statistically 
consistent estimator of the species tree topology under lineage sorting and without any non- 
trivial restrictions on branch lengths [8]. It follows that if each topology of each gene tree is 
determined by either LGT acting on the species tree topology or by incomplete lineage sorting 
(but not by both processes) then the R* tree reconstruction method can be a statistically 
consistent estimator of species tree topology (under conditions where it would be for the genes 
undergoing LGT). 

One could also ask what happens when the two processes are combined - that is, if we 
allow the ancestry of a gene to follow transfers and to coalesce within lineages, under the usual 
coalescent process. 

While it is possible to extend the earlier results a little in this direction (results not shown), 
the applicability of the results is somewhat limited for the following reason. 

LGT is especially prevalent in haploid, largely asexual taxa with limited recombination, such 
as bacteria and archaea and, in this case, although incomplete lineage sorting may apply in 
considering how genetic lineages coalesce in the species tree, there is an important difference to 
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diploid sexual taxa. Namely, in this latter case, the coalescent history of each gene sampled from 
an extant individual in each species represents an essentially independent sample from the same 
(multi-species coalescent) process, while in taxa with little or no recombination, the lineages of 
all genes follow the same ancestral trajectory, apart from LGT events. This complicates any 
statistical analysis based on assuming that the genes are independent samples from a common 
process and leads to some delicate statistical issues in attempting to analyse data with such a 
mixed mode. We defer this issue for future consideration. 

7. Missing taxa: Primordial tree consensus 

One obstacle to applying the R* construction is that many genes may not be present across 
all taxa [53]. This may be due to a variety of factors, including gene loss or gene conversion, or 
simply because certain genes have not been sequenced yet. 

Consequently, we describe a slight extension of the R* consensus approach to handle this 
situation. For any three taxa a,b,c £ X, let G(a, b, c) denote the set of genes that are present 
in all three taxa a, b, c. We will assume that the pattern of taxon coverage is sufficiently dense 
that the following condition holds: 

(13) G(a, b, c) > 0, for all a, b, c G X. 

To give some indication of how much coverage this requires, if rii denotes the number of taxa 
containing gene i G {1, . . . , k}, and n is the total number of taxa, then we require: 

k 

y^njjrii - l)(nj - 2) > n(n - l)(n - 2), 
i=i 

(from the proof of Theorem 1 of [25]), which, in turn, implies the weaker but simpler, inequality: 

k 

J^fi > 1 f o r fi = m/n. 
i=i 

We consider the following simple extension of the R* consensus method to the setting of 
partial taxon coverage. For o, b, c G X, let cf(a\bc) denote the proportion of genes in G(a, b, c) 
which resolve a, b, c as the triplet topology a\bc. 

Lemma 8. The set 

C = {A C X : cf(b\aa') > max{cf(a\a'b), cf(a'\ab)} for all a, a' G A, a / a' and b G X - A} 
forms a hierarchy, and can be constructed in time that is polynomial in n. 

This procedure has been implemented in the phylogenetic software package Dendroscope 3 
(version 3.2.2) |17] as the 'primordial tree' consensus method. 

Now, under a model in which the pattern of gene presence and absence is a random process 
(as in |25| . where the presence or absence of a gene for each taxon is an independent stochastic 
process) and provided this process is independent of the LGT process, the results on the statis- 
tical consistency of species tree reconstruction will carry over. The same also holds if we were 
to consider incomplete lineage sorting rather than LGT. 

Fig. [7] shows a tree constructed in this way from the recent bacterial data set of [1], for 
which the authors estimated that the rate of LGT was fairly high (but not too high to erase 
all phylogenetic signal). The input was 1338 rooted gene trees on variable label sets for the 
Actinobacteria phylum of their study (the clade at the upper left in Fig. 3 of p]). 

The 'unrooted' gene trees (from the website of the authors) were midpoint-rooted using the 
phylogenetic program 'Phylip'. The authors in [1] implemented a complex procedure as part 
of their 'Prunier' software, to look at all possible roofings of the input gene trees, selecting the 
ones that minimized the number of LGT events. Rooting them in a new way here provides an 
independent analysis of these data. 

The output tree is fairly close to the tree suggested by pQ . The biggest difference is that the 
primordial tree roots it at R. xylanophilus, which is a very long branch in their Fig. 3. The 
authors of pQ were concerned about long branch attraction in these data. 
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FIGURE 7. A tree constructed from 1338 rooted gene trees on overlapping taxon 
sets for the Actinobacteria phylum, which indicated high rates of LGT in the 
study by pQ. 



As a second application to a quite different data set (Eukaryotes) and a different process 
(incomplete lineage sorting rather than LGT), Fig. [7] shows a tree constructed by the same 
method, from 986 rooted gene trees from chromosome 3 of 11 taxa of the genus Oryza (rice and 
its relatives). Some trees contained all 11 taxa but most did not, however the pattern of taxon 



coverage is sufficiently dense that condition (13) does hold. While this data set is unlikely to 



exhibit LGT at anywhere near the rate of the previous one, gene flow persisting for some time 
after speciation would essentially show the same pattern as LGT. Moreover, incomplete lineage 
sorting is likely quite extensive across the genome at several nodes in the tree ([32J, |33j); Zwickl 
et al., in prep.). 



-0._nivara_AA 

- 0._sativai_AA 

- 0._rufipogon_AA 

- 0._sativaj_AA 
-0._g!aberrimaM_AA 
-0._barthii_AA 
- 0._punctata_BB 
-0._minuta_BB 
-0._officinaiis_CC 
-0._minuta_CC 

- 0._brachyantha_FF 



Figure 8. A tree constructed from 986 rooted gene trees from chromosome 3 
for taxa of the genus Oryza for which incomplete lineage sorting (rather than 
LGT) is a likely cause of gene tree discordance. 



7.1. Questions for future work. It would be interesting to extend the scope of Theorem [5] 
to include the other four-taxon tree-shape (under the standard LGT model), as well as to ana- 
lyze both tree shapes under the extended LGT model. Our analysis also raises some intriguing 
statistical questions: Is strong statistical inconsistency possible (for R* or perhaps other meth- 
ods)? Is the species tree identifiable from the probability distribution on gene trees, regardless 
of the LGT rate? If the rate of LGT decreases sufficiently fast with phylogenetic distance in the 
tree, then is statistical consistency restored for the R* method? And what can be said regarding 
statistical issues arising when we combine LGT and incomplete lineage sorting and phylogenetic 
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sampling error? Further research on these questions would help us better understand the extent 
to which signal for a species tree can be recovered above the 'noise' of random processes that 
can cause gene trees to conflict with the species tree. 
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8. Appendix: Proof of Claim 1 (from proof of Theorem [7]) 

Consider a sequence a of one or more transfer events, which contains exactly one transfer 
a r = cr(p,p') that is into an A— lineage and for which T[oJ|^4 = c\ab. We will associate with a 
another sequence of transfer events er' which induces a match for A, and which is identical to a 
except that a is substituted by a particular alternative transfer a' . 

In case a is an A— joining transfer (in which case it joins a to b, or 6 to a in order for 
r[a]|^4 = c\ab) we replace a with the transfer a' (at the same time-instant) that: 

(i) joins b to c if a joins a to b; 

(ii) joins c to b if a joins b to a. 

These two cases are illustrated in Fig. [9j 




(iii) (iv) (v) 

Figure 9. For cases (i)-(v) in the proof of Claim 1, the transfer a is shown 
as a solid horizontal arrow, and the associated transfer a 1 is shown as a dashed 
horizontal arrow. 

Otherwise, in case a = (p,p') is an A— moving transfer, either: 

(iii) des A (T,p') = {a}, or 

(iv) des A (T,p') = {&}, or 

(v) de SA (T,p') = {c}. 

Consider case (iii). Recalling that a is the r-th transfer in a, in order for a to induce the 
topology c\ab, there must be a vertex v in the derived trees T r _i for which des^(T r _i, v) = {a, b}; 
moreover, since a is the only transfer into an A— lineage, v must lie on the the path in T between 
b and the MRCA of b, c. In this case, we take a' = (p,Pb) where pb is the unique point in T 
with t(pb) = t{p) and which has des A (T,pb) = {b}. 
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In case (iv), in order for a to induce the topology c\ab there must be a vertex v in the 
derived trees T r _i for which desA(T r -i, v) = {a, &}; moreover, since a is the only transfer 
into an A— lineage, v must lie on the path in T between the a and the MRCA of A. In this 
case, we take a' = {p,p a ) where p a is the unique point in T with t(p a ) = t(p) and which has 
des A (T, p a ) = {a}. 

Case (v) is similar to case (ii) except that we can take v to be the MRCA of {a, b}, and (as 
in case (ii)) we take a' = (p,p a ) where p a is the unique point in T with t(p a ) = t(p) and which 
has des^(T,p a ) = {a}. 

Cases (iii)-(v) are also shown in Fig. [9j 

In all five cases (i)-(v), replacing a by a' in the sequence a results in the modified sequence 
a 1 that induces a match for A] moreover the association a i— > a' is one-to-one on the set of 
transfer sequences with Na = 1 and which induce c\ab and a\bc respectively. It then follows 
that p > q under the standard LGT model, as claimed. 



